Assignment-1

Author

Maria Oljaca

The primary question you will answer is whether daily concentrations of PM2.5 (particulate matter air pollution with aerodynamic diameter less than 2.5 \(\mu\)m) have decreased in California over the last 20 years (from 2002 to 2022)

Step 1

library(data.table)
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.3     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.3     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::between()     masks data.table::between()
✖ dplyr::filter()      masks stats::filter()
✖ dplyr::first()       masks data.table::first()
✖ lubridate::hour()    masks data.table::hour()
✖ lubridate::isoweek() masks data.table::isoweek()
✖ dplyr::lag()         masks stats::lag()
✖ dplyr::last()        masks data.table::last()
✖ lubridate::mday()    masks data.table::mday()
✖ lubridate::minute()  masks data.table::minute()
✖ lubridate::month()   masks data.table::month()
✖ lubridate::quarter() masks data.table::quarter()
✖ lubridate::second()  masks data.table::second()
✖ purrr::transpose()   masks data.table::transpose()
✖ lubridate::wday()    masks data.table::wday()
✖ lubridate::week()    masks data.table::week()
✖ lubridate::yday()    masks data.table::yday()
✖ lubridate::year()    masks data.table::year()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
data_02 <- fread("C:/Users/molja/Downloads/ad_viz_plotval_data.csv")
data_22 <- fread("C:/Users/molja/Downloads/ad_viz_plotval_data (1).csv")
dim(data_02)
[1] 15976    20
dim(data_22)
[1] 56140    20
head(data_02)
         Date Source  Site ID POC Daily Mean PM2.5 Concentration    UNITS
1: 01/05/2002    AQS 60010007   1                           25.1 ug/m3 LC
2: 01/06/2002    AQS 60010007   1                           31.6 ug/m3 LC
3: 01/08/2002    AQS 60010007   1                           21.4 ug/m3 LC
4: 01/11/2002    AQS 60010007   1                           25.9 ug/m3 LC
5: 01/14/2002    AQS 60010007   1                           34.5 ug/m3 LC
6: 01/17/2002    AQS 60010007   1                           41.0 ug/m3 LC
   DAILY_AQI_VALUE Site Name DAILY_OBS_COUNT PERCENT_COMPLETE
1:              78 Livermore               1              100
2:              92 Livermore               1              100
3:              71 Livermore               1              100
4:              80 Livermore               1              100
5:              98 Livermore               1              100
6:             115 Livermore               1              100
   AQS_PARAMETER_CODE       AQS_PARAMETER_DESC CBSA_CODE
1:              88101 PM2.5 - Local Conditions     41860
2:              88101 PM2.5 - Local Conditions     41860
3:              88101 PM2.5 - Local Conditions     41860
4:              88101 PM2.5 - Local Conditions     41860
5:              88101 PM2.5 - Local Conditions     41860
6:              88101 PM2.5 - Local Conditions     41860
                           CBSA_NAME STATE_CODE      STATE COUNTY_CODE  COUNTY
1: San Francisco-Oakland-Hayward, CA          6 California           1 Alameda
2: San Francisco-Oakland-Hayward, CA          6 California           1 Alameda
3: San Francisco-Oakland-Hayward, CA          6 California           1 Alameda
4: San Francisco-Oakland-Hayward, CA          6 California           1 Alameda
5: San Francisco-Oakland-Hayward, CA          6 California           1 Alameda
6: San Francisco-Oakland-Hayward, CA          6 California           1 Alameda
   SITE_LATITUDE SITE_LONGITUDE
1:      37.68753      -121.7842
2:      37.68753      -121.7842
3:      37.68753      -121.7842
4:      37.68753      -121.7842
5:      37.68753      -121.7842
6:      37.68753      -121.7842
tail(data_02)
         Date Source  Site ID POC Daily Mean PM2.5 Concentration    UNITS
1: 12/10/2002    AQS 61131003   1                             15 ug/m3 LC
2: 12/13/2002    AQS 61131003   1                             15 ug/m3 LC
3: 12/22/2002    AQS 61131003   1                              1 ug/m3 LC
4: 12/25/2002    AQS 61131003   1                             23 ug/m3 LC
5: 12/28/2002    AQS 61131003   1                              5 ug/m3 LC
6: 12/31/2002    AQS 61131003   1                              6 ug/m3 LC
   DAILY_AQI_VALUE            Site Name DAILY_OBS_COUNT PERCENT_COMPLETE
1:              57 Woodland-Gibson Road               1              100
2:              57 Woodland-Gibson Road               1              100
3:               4 Woodland-Gibson Road               1              100
4:              74 Woodland-Gibson Road               1              100
5:              21 Woodland-Gibson Road               1              100
6:              25 Woodland-Gibson Road               1              100
   AQS_PARAMETER_CODE       AQS_PARAMETER_DESC CBSA_CODE
1:              88101 PM2.5 - Local Conditions     40900
2:              88101 PM2.5 - Local Conditions     40900
3:              88101 PM2.5 - Local Conditions     40900
4:              88101 PM2.5 - Local Conditions     40900
5:              88101 PM2.5 - Local Conditions     40900
6:              88101 PM2.5 - Local Conditions     40900
                                 CBSA_NAME STATE_CODE      STATE COUNTY_CODE
1: Sacramento--Roseville--Arden-Arcade, CA          6 California         113
2: Sacramento--Roseville--Arden-Arcade, CA          6 California         113
3: Sacramento--Roseville--Arden-Arcade, CA          6 California         113
4: Sacramento--Roseville--Arden-Arcade, CA          6 California         113
5: Sacramento--Roseville--Arden-Arcade, CA          6 California         113
6: Sacramento--Roseville--Arden-Arcade, CA          6 California         113
   COUNTY SITE_LATITUDE SITE_LONGITUDE
1:   Yolo      38.66121      -121.7327
2:   Yolo      38.66121      -121.7327
3:   Yolo      38.66121      -121.7327
4:   Yolo      38.66121      -121.7327
5:   Yolo      38.66121      -121.7327
6:   Yolo      38.66121      -121.7327
head(data_22)
         Date Source  Site ID POC Daily Mean PM2.5 Concentration    UNITS
1: 01/01/2022    AQS 60010007   3                           12.7 ug/m3 LC
2: 01/02/2022    AQS 60010007   3                           13.9 ug/m3 LC
3: 01/03/2022    AQS 60010007   3                            7.1 ug/m3 LC
4: 01/04/2022    AQS 60010007   3                            3.7 ug/m3 LC
5: 01/05/2022    AQS 60010007   3                            4.2 ug/m3 LC
6: 01/06/2022    AQS 60010007   3                            3.8 ug/m3 LC
   DAILY_AQI_VALUE Site Name DAILY_OBS_COUNT PERCENT_COMPLETE
1:              52 Livermore               1              100
2:              55 Livermore               1              100
3:              30 Livermore               1              100
4:              15 Livermore               1              100
5:              18 Livermore               1              100
6:              16 Livermore               1              100
   AQS_PARAMETER_CODE       AQS_PARAMETER_DESC CBSA_CODE
1:              88101 PM2.5 - Local Conditions     41860
2:              88101 PM2.5 - Local Conditions     41860
3:              88101 PM2.5 - Local Conditions     41860
4:              88101 PM2.5 - Local Conditions     41860
5:              88101 PM2.5 - Local Conditions     41860
6:              88101 PM2.5 - Local Conditions     41860
                           CBSA_NAME STATE_CODE      STATE COUNTY_CODE  COUNTY
1: San Francisco-Oakland-Hayward, CA          6 California           1 Alameda
2: San Francisco-Oakland-Hayward, CA          6 California           1 Alameda
3: San Francisco-Oakland-Hayward, CA          6 California           1 Alameda
4: San Francisco-Oakland-Hayward, CA          6 California           1 Alameda
5: San Francisco-Oakland-Hayward, CA          6 California           1 Alameda
6: San Francisco-Oakland-Hayward, CA          6 California           1 Alameda
   SITE_LATITUDE SITE_LONGITUDE
1:      37.68753      -121.7842
2:      37.68753      -121.7842
3:      37.68753      -121.7842
4:      37.68753      -121.7842
5:      37.68753      -121.7842
6:      37.68753      -121.7842
tail(data_22)
         Date Source  Site ID POC Daily Mean PM2.5 Concentration    UNITS
1: 12/01/2022    AQS 61131003   1                            3.4 ug/m3 LC
2: 12/07/2022    AQS 61131003   1                            3.8 ug/m3 LC
3: 12/13/2022    AQS 61131003   1                            6.0 ug/m3 LC
4: 12/19/2022    AQS 61131003   1                           34.8 ug/m3 LC
5: 12/25/2022    AQS 61131003   1                           23.2 ug/m3 LC
6: 12/31/2022    AQS 61131003   1                            1.0 ug/m3 LC
   DAILY_AQI_VALUE            Site Name DAILY_OBS_COUNT PERCENT_COMPLETE
1:              14 Woodland-Gibson Road               1              100
2:              16 Woodland-Gibson Road               1              100
3:              25 Woodland-Gibson Road               1              100
4:              99 Woodland-Gibson Road               1              100
5:              74 Woodland-Gibson Road               1              100
6:               4 Woodland-Gibson Road               1              100
   AQS_PARAMETER_CODE       AQS_PARAMETER_DESC CBSA_CODE
1:              88101 PM2.5 - Local Conditions     40900
2:              88101 PM2.5 - Local Conditions     40900
3:              88101 PM2.5 - Local Conditions     40900
4:              88101 PM2.5 - Local Conditions     40900
5:              88101 PM2.5 - Local Conditions     40900
6:              88101 PM2.5 - Local Conditions     40900
                                 CBSA_NAME STATE_CODE      STATE COUNTY_CODE
1: Sacramento--Roseville--Arden-Arcade, CA          6 California         113
2: Sacramento--Roseville--Arden-Arcade, CA          6 California         113
3: Sacramento--Roseville--Arden-Arcade, CA          6 California         113
4: Sacramento--Roseville--Arden-Arcade, CA          6 California         113
5: Sacramento--Roseville--Arden-Arcade, CA          6 California         113
6: Sacramento--Roseville--Arden-Arcade, CA          6 California         113
   COUNTY SITE_LATITUDE SITE_LONGITUDE
1:   Yolo      38.66121      -121.7327
2:   Yolo      38.66121      -121.7327
3:   Yolo      38.66121      -121.7327
4:   Yolo      38.66121      -121.7327
5:   Yolo      38.66121      -121.7327
6:   Yolo      38.66121      -121.7327
str(data_02)
Classes 'data.table' and 'data.frame':  15976 obs. of  20 variables:
 $ Date                          : chr  "01/05/2002" "01/06/2002" "01/08/2002" "01/11/2002" ...
 $ Source                        : chr  "AQS" "AQS" "AQS" "AQS" ...
 $ Site ID                       : int  60010007 60010007 60010007 60010007 60010007 60010007 60010007 60010007 60010007 60010007 ...
 $ POC                           : int  1 1 1 1 1 1 1 1 1 1 ...
 $ Daily Mean PM2.5 Concentration: num  25.1 31.6 21.4 25.9 34.5 41 29.3 15 18.8 37.9 ...
 $ UNITS                         : chr  "ug/m3 LC" "ug/m3 LC" "ug/m3 LC" "ug/m3 LC" ...
 $ DAILY_AQI_VALUE               : int  78 92 71 80 98 115 87 57 65 107 ...
 $ Site Name                     : chr  "Livermore" "Livermore" "Livermore" "Livermore" ...
 $ DAILY_OBS_COUNT               : int  1 1 1 1 1 1 1 1 1 1 ...
 $ PERCENT_COMPLETE              : num  100 100 100 100 100 100 100 100 100 100 ...
 $ AQS_PARAMETER_CODE            : int  88101 88101 88101 88101 88101 88101 88101 88101 88101 88101 ...
 $ AQS_PARAMETER_DESC            : chr  "PM2.5 - Local Conditions" "PM2.5 - Local Conditions" "PM2.5 - Local Conditions" "PM2.5 - Local Conditions" ...
 $ CBSA_CODE                     : int  41860 41860 41860 41860 41860 41860 41860 41860 41860 41860 ...
 $ CBSA_NAME                     : chr  "San Francisco-Oakland-Hayward, CA" "San Francisco-Oakland-Hayward, CA" "San Francisco-Oakland-Hayward, CA" "San Francisco-Oakland-Hayward, CA" ...
 $ STATE_CODE                    : int  6 6 6 6 6 6 6 6 6 6 ...
 $ STATE                         : chr  "California" "California" "California" "California" ...
 $ COUNTY_CODE                   : int  1 1 1 1 1 1 1 1 1 1 ...
 $ COUNTY                        : chr  "Alameda" "Alameda" "Alameda" "Alameda" ...
 $ SITE_LATITUDE                 : num  37.7 37.7 37.7 37.7 37.7 ...
 $ SITE_LONGITUDE                : num  -122 -122 -122 -122 -122 ...
 - attr(*, ".internal.selfref")=<externalptr> 
str(data_22)
Classes 'data.table' and 'data.frame':  56140 obs. of  20 variables:
 $ Date                          : chr  "01/01/2022" "01/02/2022" "01/03/2022" "01/04/2022" ...
 $ Source                        : chr  "AQS" "AQS" "AQS" "AQS" ...
 $ Site ID                       : int  60010007 60010007 60010007 60010007 60010007 60010007 60010007 60010007 60010007 60010007 ...
 $ POC                           : int  3 3 3 3 3 3 3 3 3 3 ...
 $ Daily Mean PM2.5 Concentration: num  12.7 13.9 7.1 3.7 4.2 3.8 2.3 6.9 13.6 11.2 ...
 $ UNITS                         : chr  "ug/m3 LC" "ug/m3 LC" "ug/m3 LC" "ug/m3 LC" ...
 $ DAILY_AQI_VALUE               : int  52 55 30 15 18 16 10 29 54 47 ...
 $ Site Name                     : chr  "Livermore" "Livermore" "Livermore" "Livermore" ...
 $ DAILY_OBS_COUNT               : int  1 1 1 1 1 1 1 1 1 1 ...
 $ PERCENT_COMPLETE              : num  100 100 100 100 100 100 100 100 100 100 ...
 $ AQS_PARAMETER_CODE            : int  88101 88101 88101 88101 88101 88101 88101 88101 88101 88101 ...
 $ AQS_PARAMETER_DESC            : chr  "PM2.5 - Local Conditions" "PM2.5 - Local Conditions" "PM2.5 - Local Conditions" "PM2.5 - Local Conditions" ...
 $ CBSA_CODE                     : int  41860 41860 41860 41860 41860 41860 41860 41860 41860 41860 ...
 $ CBSA_NAME                     : chr  "San Francisco-Oakland-Hayward, CA" "San Francisco-Oakland-Hayward, CA" "San Francisco-Oakland-Hayward, CA" "San Francisco-Oakland-Hayward, CA" ...
 $ STATE_CODE                    : int  6 6 6 6 6 6 6 6 6 6 ...
 $ STATE                         : chr  "California" "California" "California" "California" ...
 $ COUNTY_CODE                   : int  1 1 1 1 1 1 1 1 1 1 ...
 $ COUNTY                        : chr  "Alameda" "Alameda" "Alameda" "Alameda" ...
 $ SITE_LATITUDE                 : num  37.7 37.7 37.7 37.7 37.7 ...
 $ SITE_LONGITUDE                : num  -122 -122 -122 -122 -122 ...
 - attr(*, ".internal.selfref")=<externalptr> 
summary(data_02)
     Date              Source             Site ID              POC       
 Length:15976       Length:15976       Min.   :60010007   Min.   :1.000  
 Class :character   Class :character   1st Qu.:60290014   1st Qu.:1.000  
 Mode  :character   Mode  :character   Median :60590007   Median :1.000  
                                       Mean   :60549600   Mean   :1.581  
                                       3rd Qu.:60731002   3rd Qu.:1.000  
                                       Max.   :61131003   Max.   :6.000  
                                                                         
 Daily Mean PM2.5 Concentration    UNITS           DAILY_AQI_VALUE 
 Min.   :  0.00                 Length:15976       Min.   :  0.00  
 1st Qu.:  7.00                 Class :character   1st Qu.: 29.00  
 Median : 12.00                 Mode  :character   Median : 50.00  
 Mean   : 16.12                                    Mean   : 53.68  
 3rd Qu.: 20.50                                    3rd Qu.: 69.00  
 Max.   :104.30                                    Max.   :176.00  
                                                                   
  Site Name         DAILY_OBS_COUNT PERCENT_COMPLETE AQS_PARAMETER_CODE
 Length:15976       Min.   :1       Min.   :100      Min.   :88101     
 Class :character   1st Qu.:1       1st Qu.:100      1st Qu.:88101     
 Mode  :character   Median :1       Median :100      Median :88101     
                    Mean   :1       Mean   :100      Mean   :88215     
                    3rd Qu.:1       3rd Qu.:100      3rd Qu.:88502     
                    Max.   :1       Max.   :100      Max.   :88502     
                                                                       
 AQS_PARAMETER_DESC   CBSA_CODE      CBSA_NAME           STATE_CODE
 Length:15976       Min.   :12540   Length:15976       Min.   :6   
 Class :character   1st Qu.:23420   Class :character   1st Qu.:6   
 Mode  :character   Median :40140   Mode  :character   Median :6   
                    Mean   :33270                      Mean   :6   
                    3rd Qu.:41740                      3rd Qu.:6   
                    Max.   :49700                      Max.   :6   
                    NA's   :929                                    
    STATE            COUNTY_CODE        COUNTY          SITE_LATITUDE  
 Length:15976       Min.   :  1.00   Length:15976       Min.   :32.63  
 Class :character   1st Qu.: 29.00   Class :character   1st Qu.:34.07  
 Mode  :character   Median : 59.00   Mode  :character   Median :35.36  
                    Mean   : 54.78                      Mean   :36.00  
                    3rd Qu.: 73.00                      3rd Qu.:37.77  
                    Max.   :113.00                      Max.   :41.71  
                                                                       
 SITE_LONGITUDE  
 Min.   :-124.2  
 1st Qu.:-121.4  
 Median :-119.1  
 Mean   :-119.4  
 3rd Qu.:-117.9  
 Max.   :-115.5  
                 
summary(data_22)
     Date              Source             Site ID              POC        
 Length:56140       Length:56140       Min.   :60010007   Min.   : 1.000  
 Class :character   Class :character   1st Qu.:60310004   1st Qu.: 1.000  
 Mode  :character   Mode  :character   Median :60631006   Median : 3.000  
                                       Mean   :60567850   Mean   : 2.549  
                                       3rd Qu.:60750005   3rd Qu.: 3.000  
                                       Max.   :61131003   Max.   :21.000  
                                                                          
 Daily Mean PM2.5 Concentration    UNITS           DAILY_AQI_VALUE 
 Min.   : -2.20                 Length:56140       Min.   :  0.00  
 1st Qu.:  4.20                 Class :character   1st Qu.: 18.00  
 Median :  6.90                 Mode  :character   Median : 29.00  
 Mean   :  8.52                                    Mean   : 32.84  
 3rd Qu.: 10.80                                    3rd Qu.: 45.00  
 Max.   :302.50                                    Max.   :353.00  
                                                                   
  Site Name         DAILY_OBS_COUNT PERCENT_COMPLETE AQS_PARAMETER_CODE
 Length:56140       Min.   :1       Min.   :100      Min.   :88101     
 Class :character   1st Qu.:1       1st Qu.:100      1st Qu.:88101     
 Mode  :character   Median :1       Median :100      Median :88101     
                    Mean   :1       Mean   :100      Mean   :88197     
                    3rd Qu.:1       3rd Qu.:100      3rd Qu.:88101     
                    Max.   :1       Max.   :100      Max.   :88502     
                                                                       
 AQS_PARAMETER_DESC   CBSA_CODE      CBSA_NAME           STATE_CODE
 Length:56140       Min.   :12540   Length:56140       Min.   :6   
 Class :character   1st Qu.:31080   Class :character   1st Qu.:6   
 Mode  :character   Median :40140   Mode  :character   Median :6   
                    Mean   :35340                      Mean   :6   
                    3rd Qu.:41860                      3rd Qu.:6   
                    Max.   :49700                      Max.   :6   
                    NA's   :4199                                   
    STATE            COUNTY_CODE        COUNTY          SITE_LATITUDE  
 Length:56140       Min.   :  1.00   Length:56140       Min.   :32.58  
 Class :character   1st Qu.: 31.00   Class :character   1st Qu.:34.14  
 Mode  :character   Median : 63.00   Mode  :character   Median :36.50  
                    Mean   : 56.64                      Mean   :36.33  
                    3rd Qu.: 75.00                      3rd Qu.:37.97  
                    Max.   :113.00                      Max.   :41.76  
                                                                       
 SITE_LONGITUDE  
 Min.   :-124.2  
 1st Qu.:-121.5  
 Median :-119.7  
 Mean   :-119.7  
 3rd Qu.:-118.1  
 Max.   :-115.5  
                 
sum(is.na(data_02))
[1] 929
sum(is.na(data_22))
[1] 4199

The 2002 data has 15975 rows (observations) and 20 columns (variables). The 2022 data has 56140 rows (observations) and 20 columns (variables). There were 929 NAʼs in 2002 data and 4199 NAʼs in 2022 data, all of which were in the “CBSA_CODE” category.

When checking our data, we see that daily concentrations of PM2.5 in the 2022 data have a minimum of -2.5. It doesn’t make sense for the concentration to be less than zero, so we should remove those values less than zero.

data_22 <- data_22[data_22$`Daily Mean PM2.5 Concentration` >= 0, ]
summary(data_22$`Daily Mean PM2.5 Concentration`)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   4.200   7.000   8.554  10.800 302.500 

Step 2: Combine Data

combined_data <- rbindlist(list(
data_02[, year := 2002],
data_22[, year := 2022]))
setnames(combined_data, c("Daily Mean PM2.5 Concentration", "SITE_LATITUDE", "SITE_LONGITUDE"), c("PM2.5", "lat", "lon"))
setnames(data_02, c("Daily Mean PM2.5 Concentration", "SITE_LATITUDE", "SITE_LONGITUDE"), c("PM2.5", "lat", "lon"))
setnames(data_22, c("Daily Mean PM2.5 Concentration", "SITE_LATITUDE", "SITE_LONGITUDE"), c("PM2.5", "lat", "lon"))
head(combined_data)
         Date Source  Site ID POC PM2.5    UNITS DAILY_AQI_VALUE Site Name
1: 01/05/2002    AQS 60010007   1  25.1 ug/m3 LC              78 Livermore
2: 01/06/2002    AQS 60010007   1  31.6 ug/m3 LC              92 Livermore
3: 01/08/2002    AQS 60010007   1  21.4 ug/m3 LC              71 Livermore
4: 01/11/2002    AQS 60010007   1  25.9 ug/m3 LC              80 Livermore
5: 01/14/2002    AQS 60010007   1  34.5 ug/m3 LC              98 Livermore
6: 01/17/2002    AQS 60010007   1  41.0 ug/m3 LC             115 Livermore
   DAILY_OBS_COUNT PERCENT_COMPLETE AQS_PARAMETER_CODE       AQS_PARAMETER_DESC
1:               1              100              88101 PM2.5 - Local Conditions
2:               1              100              88101 PM2.5 - Local Conditions
3:               1              100              88101 PM2.5 - Local Conditions
4:               1              100              88101 PM2.5 - Local Conditions
5:               1              100              88101 PM2.5 - Local Conditions
6:               1              100              88101 PM2.5 - Local Conditions
   CBSA_CODE                         CBSA_NAME STATE_CODE      STATE
1:     41860 San Francisco-Oakland-Hayward, CA          6 California
2:     41860 San Francisco-Oakland-Hayward, CA          6 California
3:     41860 San Francisco-Oakland-Hayward, CA          6 California
4:     41860 San Francisco-Oakland-Hayward, CA          6 California
5:     41860 San Francisco-Oakland-Hayward, CA          6 California
6:     41860 San Francisco-Oakland-Hayward, CA          6 California
   COUNTY_CODE  COUNTY      lat       lon year
1:           1 Alameda 37.68753 -121.7842 2002
2:           1 Alameda 37.68753 -121.7842 2002
3:           1 Alameda 37.68753 -121.7842 2002
4:           1 Alameda 37.68753 -121.7842 2002
5:           1 Alameda 37.68753 -121.7842 2002
6:           1 Alameda 37.68753 -121.7842 2002

Step 3: Basic Map

library(leaflet)
leaflet(combined_data) %>%
  addProviderTiles('CartoDB.Positron') %>%
  addCircleMarkers(
    lat = ~lat,
    lng = ~lon,
    color = ~ifelse(year == 2002, "red", "yellow"),
    weight = 2,
    opacity = 0.1,
    radius = 0.001
  ) %>%
  addLegend(
    "bottomleft",
    colors = c("red", "yellow"),
    labels = c("2002 Sites", "2022 Sites")
    )

Overall, there seem to be a lot more sites present in 2022 than in 2002. There seem to be sites all over California except for in the southeast part. While there are sites across the whole state, a lot of sites seem to be concentrated around bigger cities, such as Los Angeles.

Step 4:

sum(is.na(combined_data$PM2.5))
[1] 0
sum(combined_data$PM2.5 < 0)
[1] 0
summary(combined_data$PM2.5)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00    4.60    7.70   10.23   12.40  302.50 

It appears that there are no missing or implausible values in the combined data set.

Step 5: Daily concentrations of PM2.5 in CA at three different spatial levels in 2002 and 2022

The primary question you will answer is whether daily concentrations of PM2.5 (particulate matter air pollution with aerodynamic diameter less than 2.5 \(\mu\)m) have decreased in California over the last 20 years (from 2002 to 2022).

State Level:

combined_data %>%
  group_by(year) %>%
  summarize(
    Min_PM2.5 = min(PM2.5, na.rm = TRUE),
    Q1_PM2.5 = quantile(PM2.5, 0.25, na.rm = TRUE),
    Mean_PM2.5 = mean(PM2.5, na.rm = TRUE),
    Median_PM2.5 = median(PM2.5, na.rm = TRUE),
    Q3_PM2.5 = quantile(PM2.5, 0.75, na.rm = TRUE),
    Max_PM2.5 = max(PM2.5, na.rm = TRUE)
  )
# A tibble: 2 × 7
   year Min_PM2.5 Q1_PM2.5 Mean_PM2.5 Median_PM2.5 Q3_PM2.5 Max_PM2.5
  <dbl>     <dbl>    <dbl>      <dbl>        <dbl>    <dbl>     <dbl>
1  2002         0      7        16.1            12     20.5      104.
2  2022         0      4.2       8.55            7     10.8      302.
combined_data %>%
  group_by(year) %>%
  ggplot(aes(x = Date, y = PM2.5)) +
  geom_line() +
  labs(title = "PM2.5 Concentration", x = "Date", y = "PM2.5 Concentration") +
  facet_wrap(~ year, ncol = 2)

average_pm_by_year <- combined_data %>%
group_by(year) %>%
summarize(
Average_PM = mean(PM2.5, na.rm = TRUE),
SD_PM = sd(PM2.5, na.rm = TRUE)
)
ggplot(average_pm_by_year, aes(x = as.factor(year), y = Average_PM)) +
geom_bar(stat = "identity", fill = "darkgreen") +
geom_errorbar(
aes(ymin = Average_PM - SD_PM, ymax = Average_PM + SD_PM),
width = 0.2,
position = position_dodge(width = 0.9)) +
labs(title = "Average PM2.5 Level in California by Year (2002-2022)", x = "Year", y = "Average PM2.5 Level")

t_test_state <- t.test(data_02$PM2.5, data_22$PM2.5)
t_test_state

    Welch Two Sample t-test

data:  data_02$PM2.5 and data_22$PM2.5
t = 66.136, df = 18805, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 7.337922 7.786156
sample estimates:
mean of x mean of y 
16.115943  8.553904 

Looking at the summary statistics by year and the barplots, we can see that the average daily PM2.5 concentration is lower in 2022 than in 2002. However, 2022 has a higher max daily PM2.5 concentration (302.5 vs 104.3 in 2002). We can see this max as an apparent spike in the 2022 time series plot, which looks to be approximately in the late summer to early fall. The t-test shows that this difference in average daily PM2.5 concentration was statistically significant different (p<0.001).

County Level

average_pm_by_county_02 <- combined_data[combined_data$year == 2002, ] %>%
group_by(COUNTY) %>%
summarize(
Average_PM_2002 = mean(PM2.5, na.rm = TRUE),
SD_PM_2002 = sd(PM2.5, na.rm = TRUE),
Year = mean(year),
Lat = mean(lat),
Lon = mean(lon))

average_pm_by_county_22 <- combined_data[combined_data$year == 2022, ] %>%
group_by(COUNTY) %>%
summarize(
Average_PM_2022 = mean(PM2.5, na.rm = TRUE),
SD_PM_2022 = sd(PM2.5, na.rm = TRUE),
Year = mean(year),
Lat = mean(lat),
Lon = mean(lon))

County_mean <- rbindlist(list(
average_pm_by_county_02,
average_pm_by_county_22))
Column 2 ['Average_PM_2022'] of item 2 is missing in item 1. Use fill=TRUE to fill with NA (NULL for list columns), or use.names=FALSE to ignore column names. use.names='check' (default from v1.12.2) emits this message and proceeds as if use.names=FALSE for  backwards compatibility. See news item 5 in v1.12.2 for options to control this message.
colnames(County_mean)[colnames(County_mean) == "Average_PM_2002"] <- "Average_PM"
colnames(County_mean)[colnames(County_mean) == "SD_PM_2002"] <- "SD_PM"
color_palette <- colorNumeric(
palette = "viridis",
domain = County_mean$Average_PM_2002
)

temp.pal02 <- colorNumeric(c('darkgreen','goldenrod','brown'), domain = average_pm_by_county_02$Average_PM_2002)
                           
CountyPMmap02 <- leaflet(average_pm_by_county_02) %>%
addProviderTiles('CartoDB.Positron') %>%
addCircles(
lat = ~Lat, lng=~Lon,
label = ~paste0(round(average_pm_by_county_02$Average_PM_2002,2), ' PM2.5'),
color = ~temp.pal02(average_pm_by_county_02$Average_PM_2002),
opacity = 1, fillOpacity = 1, radius = 500
) %>%
addLegend('bottomleft', pal=temp.pal02, values=average_pm_by_county_02$Average_PM_2002,
title='Mean PM2.5 by County (2002)', opacity=1)
CountyPMmap02
temp.pal22 <- colorNumeric(c('darkgreen','goldenrod','brown'), domain = average_pm_by_county_22$Average_PM_2022)
                           
CountyPMmap22 <- leaflet(average_pm_by_county_22) %>%
addProviderTiles('CartoDB.Positron') %>%
addCircles(
lat = ~Lat, lng=~Lon,
label = ~paste0(round(average_pm_by_county_22$Average_PM_2022,2), ' PM2.5'),
color = ~temp.pal02(average_pm_by_county_22$Average_PM_2022),
opacity = 1, fillOpacity = 1, radius = 500
) %>%
addLegend('bottomleft', pal=temp.pal22, values=average_pm_by_county_22$Average_PM_2022,
title='Mean PM2.5 by County (2022)', opacity=1)
CountyPMmap22
library(ggplot2)

# Filter the data for 2002 and 2022
data_02 <- combined_data[combined_data$year == 2002, ]
data_22 <- combined_data[combined_data$year == 2022, ]

# Calculate average PM2.5 concentrations by county for both years
average_pm_by_county_2002 <- data_02 %>%
  group_by(COUNTY) %>%
  summarize(Average_PM2.5 = mean(PM2.5, na.rm = TRUE))

average_pm_by_county_2022 <- data_22 %>%
  group_by(COUNTY) %>%
  summarize(Average_PM2.5 = mean(PM2.5, na.rm = TRUE))

# Merge the two datasets for comparison
comparison_data <- merge(average_pm_by_county_2002, average_pm_by_county_2022, by = "COUNTY", suffixes = c("_2002", "_2022"))

# Reshape the data into long format
comparison_data_long <- pivot_longer(comparison_data, cols = starts_with("Average_PM2.5"), names_to = "Year", values_to = "Average_PM2.5")

# Create the side-by-side bar plot
ggplot(comparison_data_long, aes(x = COUNTY, y = Average_PM2.5, fill = Year)) +
  geom_bar(stat = "identity", position = position_dodge(width = 0.7), alpha = 0.7, width = 0.7) +
  labs(title = "Average PM2.5 Concentrations by County (2002 vs. 2022)",
       x = "County", y = "Average PM2.5 Concentration") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_fill_manual(values = c("blue", "red"), labels = c("2002", "2022")) +
  guides(fill = guide_legend(title = "Year"))

Looking at the maps and bar plot, we can see that average PM2.5 concentrations in each county are higher in 2002 than 2022 in general.

County Level

# Filter the data for Los Angeles County
los_angeles_data <- combined_data %>%
  filter(COUNTY == "Los Angeles")

# Get unique site names in Los Angeles County
unique_site_names <- unique(los_angeles_data$`Site Name`)
unique_site_names
 [1] "Azusa"                          "Burbank"                       
 [3] "Los Angeles-North Main Street"  "Reseda"                        
 [5] "Lynwood"                        ""                              
 [7] "Pasadena"                       "Long Beach (North)"            
 [9] "Lancaster-Division Street"      "Lebec"                         
[11] "Glendora"                       "Compton"                       
[13] "Pico Rivera #2"                 "Long Beach (South)"            
[15] "Long Beach-Route 710 Near Road" "Signal Hill (LBSH)"            
[17] "North Hollywood (NOHO)"         "Santa Clarita"                 
# Filter the data for the selected site
Pasadena_data <- los_angeles_data %>%
  filter(`Site Name` == "Pasadena")

Pasadena_data %>%
  group_by(year) %>%
   summarize(
    Mean_PM2.5 = mean(PM2.5),
    SD_PM2.5 = sd(PM2.5),
    Q1_PM2.5 = quantile(PM2.5, 0.25),
    Median_PM2.5 = median(PM2.5),
    Q3_PM2.5 = quantile(PM2.5, 0.75)
  )
# A tibble: 2 × 6
   year Mean_PM2.5 SD_PM2.5 Q1_PM2.5 Median_PM2.5 Q3_PM2.5
  <dbl>      <dbl>    <dbl>    <dbl>        <dbl>    <dbl>
1  2002      20.3     11.1      12           17.8     25.2
2  2022       9.09     3.68      6.4          7.9     11.6
# Filter the data for Pasadena and the years 2002 and 2022
Pasadena_data_2002 <- Pasadena_data %>%
  filter(year == 2002)
Pasadena_data_2022 <- Pasadena_data %>%
  filter(year == 2022)

# Create a bar graph comparing PM2.5 in 2002 vs 2022
library(ggplot2)

ggplot() +
  geom_bar(data = Pasadena_data_2002, aes(x = "2002", y = PM2.5), stat = "identity", fill = "red", width = 0.5) +
  geom_bar(data = Pasadena_data_2022, aes(x = "2022", y = PM2.5), stat = "identity", fill = "blue", width = 0.5) +
  labs(title = "PM2.5 Concentration in Pasadena (2002 vs 2022)", x = "Year", y = "PM2.5 Concentration") +
  theme_minimal()

t_test_pasadena <- t.test(Pasadena_data_2002$PM2.5, Pasadena_data_2022$PM2.5)
t_test_pasadena

    Welch Two Sample t-test

data:  Pasadena_data_2002$PM2.5 and Pasadena_data_2022$PM2.5
t = 10.491, df = 146.06, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
  9.087496 13.305989
sample estimates:
mean of x mean of y 
20.290909  9.094167 

Looking at the summary statistics and bar plot, we can see that average daily PM2.5 concentration in Pasadena was higher in 2002 vs 2022. The t-test shows that this difference in average daily PM2.5 concentration was statistically significant different (p<0.001).